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Summary. Network data often take the form of repeated interactions between senders and re- 
ceivers tabulated over time. A primary question to ask of such data is which traits and behaviors 
are predictive of interaction. To answer this question, a model is introduced for treating directed 
interactions as a multivariate point process: a Cox multiplicative intensity model using covariates 
that depend on the history of the process. Consistency and asymptotic normality are proved for 
the resulting partial-likelihood-based estimators under suitable regularity conditions, and an efficient 
fitting procedure is described. Multicast interactions — those involving a single sender but multiple 
receivers — are treated explicitly. The resulting inferential framework is then employed to model mes- 
sage sending behavior in a corporate e-mail network. The analysis gives a precise quantification of 
which static shared traits and dynamic network effects are predictive of message recipient selection. 
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1. Introduction 



Much effort has been devoted to the statistical analysis of network data; see Jackson (2008), 



Goldenberg et al. (2009 1, and Kolaczyk (2009) for recent overviews. Often network observables 



comprise counts of interactions between individuals or groups tabulated over time. Communica- 
tions networks give rise to directed interactions: phone calls, text messages, or e-mails exchanged 



amongst a given set of individuals over a specific time period (Tyler et al. 2005: Eagle and Pent 



land 2006). Specific examples of repeated interactions from other types of networks include the 



following: Fowler s (2006) study of legislators authoring and cosponsoring bills (a collaboration 
network); Mckenzie and Rapoport s (2007 1 study of families migrating between communities in 
Mexico (a migration network); Sundaresan, Fischoff, Dushoff, and Rubenstein s (2007) study of 
zebras congregating at locations in their habitat (an animal association network); and Papachris- 
|tos| s ( |2009| study of gangs in Chicago murdering members of rival factions (an organized crime 
network) . 

In this article, we consider partial-likelihood-based inference for general directed interaction 
data in the presence of covariates. We first develop asymptotic theory for the case in which inter- 
actions are strictly pairwise, and then generalize our results to the multiple-receiver (multicast) 
case; we also provide efficient algorithms for partial likelihood maximization in these settings. 
Our main assumption on the covariates is that they be predictable, which allows them to vary 
with time and potentially depend on the past. 
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The interaction data we consider comprise a set of triples, with triple (t,i,j) indicating that 
at time t, directed interaction i — > j took place — for instance, individual i sent a message to 
individual j. Given such a set of triples, a primary modeling goal lies in determining which 
characteristics and behaviors of the senders and receivers are predictive of interaction. In this 
vein, three important questions stand out: 

Homophily Is there evidence of homophily (an increased rate of interaction among similar 
individuals)? To what degree is a shared attribute predictive of heightened interaction? 

Network Effects To what extent are past interaction behaviors predictive of future ones? If 
we observe interactions i — > h and h — > j, are we more likely to see the interaction i — > j? 

Multiplicity How should multiple-receiver interactions of the type i — > {ii,j2, ■ • • , jh} be mod- 
eled? What are the implications of treating these as L separate pairwise interactions? 

The issues of homophily, network effects, and their interactions arise frequently in the net- 



works literature; see, e.g., McPherson et al. (2001); Butts (2008); Aral et al. (2009); Snijders et al 



(2010), and references contained therein. Multiplicity has largely been ignored in this context, 



Shafiei and Chipman ( 2010 ) for network modeling 



however, with notable exceptions including Lunagomez et al. (2009) for graphical models, and 



In the remainder of this article, we provide a modeling framework and computationally effi- 
cient partial likelihood inference procedures to facilitate analysis of these questions. We employ 
a Cox proportional intensity model incorporating both static and history-dependent covariates 
to address the first of these two questions, and a parametric bootstrap to address the third. Sec- 
tion[2]presents our point process model for directed pairwise interactions, along with the resultant 
inference procedures. Section [3] proves consistency and asymptotic normality of the correspond- 
ing maximum partial likelihood estimator, and Section [4] extends our framework to the case of 
multiple-receiver interactions. Section [5] employs this framework to model message sending be- 
havior in a corporate e-mail network. Section [6] evaluates the strength of homophily and network 
effects in explaining these data, and discusses the results of three comparative analyses. Section [7] 
concludes the main body of the article, and Appendices [A[|C| contain implementation details and 
technical lemmas. 



2. A point process model and partial likelihood inference 



Every interaction process can be encoded by a multivariate counting measure. For sender i, 
receiver j, and positive time t, define 

Nt(i,j) = ^{directed interactions i — ¥ j in time interval [0, f]}. 

For technical reasons, assume that No(i, j) = and that N t (i, j) is adapted to a stochastic basis of 
cr-algebras {J-t}t>o satisfying the usual conditions. Then, N t (i,j) is a local submartingale, so by 
the Doob-Meyer decomposition, there exists a predictable increasing process A t (i,j), null at zero, 
such that N t (i, j) — A t (i, j) is an J^-local martingale. Under mild conditions — the most important 
of which is that no two interactions happen simultaneously — there exists a predictable continuous 



process \ t {i,j) such that A t (i,j) 
exist and are an annoyance; 



X s (i,j)ds. (In practical applications, simultaneous events 



Efron (U977) handles simultaneity through an ad-hoc adjustment, 



while Brostrom ( 2002 ) adds a discrete component to A.) The process A is known as the stochastic 
intensity of N. Heuristically, 



At(i, j) dt = Pjinteraction i —± j occurs in time interval [t, t + dt)}. 
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We will model N through A using a version of the Cox ( 1972 ) proportional intensity model 



Let X be a set of senders and J be a (not necessarily disjoint) set of receivers. For each sender 
i, let Xt(i) be a non-negative predictable process called the baseline intensity of sender i; let Jt(i) 
be a predictable finite subset of J called the receiver set of sender i. For each sender-receiver 
pair let x t (i,j) be a predictable locally bounded vector of covariates in W. Let /3q be 

an unknown vector of coefficients in W. For the remainder of this section, assume that each 
interaction has a single receiver. 

Given a multivariate counting process N on R + x 1 x J , we model its stochastic intensity as 



= A t (<) • exp{$x t (i, j)} ■ l{j e J t (i)}. 



(1) 



This model posits that sender i in I interacts with receiver j in Jt{i) at a baseline rate A t (i) 
modulated up or down according to the pair's covariate vector, x t (i,j). As Efron (1977) notes, 
the specific parametric form for the multiplier exp{f3gXt(i,j)} is not central to the theoreti- 
cal analysis, but this choice is amenable to computation and gives the parameter vector /3 a 
straightforward interpretation. Butts (2008) used a variant of this model to analyze repeated 



directed actions within social settings. 

The form of ([!]) is deceptively simple but remains flexible enough to be useful in practice. 
The model allows for homophily and group level effects via inclusion of covariates of the form 
"l{i and j belong to the same group}," where "group" is some observable trait like ethnicity, 
gender, or age group. Its real strength, though, is that x t (i,j) is allowed to be any predictable 
process, in particular x t (i,j) can depend on the history of interactions. To model reciprocation 
and transitivity in the interactions (with I = for example, choose appropriate values for 
and include relevant covariates in Xt(i,j): 



and 



ljinteraction j i occurred in [t — Ak,t)} 



l{for some h, interactions i — > h and h — > j occurred in [t — t)}. 



Any process measurable with respect to the predictable er-algebra is a valid covariate; this ex- 



cludes only covariates depending on the future or the immediate present. In Section 5.2 we detail 
specific covariates suitable for measuring homophily and network effects. 

Also note that despite presuming X and J to be fixed, our analysis allows senders and 
receivers to enter and leave the study during the observation period. The effective number of 
senders at time t is the set of i such that X t (i) ^ 0, which potentially varies with time. Likewise, 
the effective number of receivers is controlled through Jt(i). 



Following Cox (1975), we treat the baseline rate At(i) as a nuisance parameter and estimate 
the coefficient vector /3q using a partial likelihood. Specifically, let (ti,ii,ji),...,(t n ,i n ,j n ) be 
the sequence of observed interactions. The inference procedure is motivated by decomposing the 
full likelihood, L, as 

L(t 1 ,h,ji,t2,i2,j2,-- -,t n ,i ni j n ) 

= L(tt,h) L(t 2 ,i2\ti,ii,ji) L(j 2 \t 2 ,i2,ti,iiji) 

■ ' ' L(t n , i n \t n -i, i n —i,3n—ii ■ ■ • ^lj HiJi) L\j n \t n , i n , f n _i, i n —x ■ ■ ■ ii, i%, ji) 
L{ti,h) L{t 2 ,i 2 \ti,ii,ji) ■ ■ ■ L(t n , i n \tn-l,i n -l,jn-l,- ■ -h,h,ji) 

L(h\h,h) L(j2\t2,i2,h,ii,ji) ■ ■ ■ L(j n \t n ,i n ,t n -i,i n -i ...ti,h,ji) ; 
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the term comprised of the product of conditional likelihoods of j\ , . . . , j n is dubbed a partial 
likelihood. In continuous time, the log partial likelihood at time t, evaluated at /?, is 



log PL t (P) = \ P Tx tJim,: 3m) - log [ exp{(3 T x tm (i m , j)}) 



(2) 



In Section [3j we prove under suitable regularity conditions that the maximizer of log PL t {-) is a 
consistent estimator of /3q as t increases. 

The function log PL t (•) is concave, and so can be maximized via Newton's method or a 
gradient-based optimization approach (Nocedal and Wright 20061. These methods require one 
or both of the first two derivatives of logPL t (-), which can be expressed in terms of weighted 
means and covariances of the covariates. The weights are 



w t (/3,i,j) = exv{/3 T x t (i, j)} ■ l{j e J t (i)}, 
W t (p,i)= Y wt(fi,i,j). 



(3a) 
(3b) 



The inner sum in log PL t ((3) is Wt m (f3,i m ). The function logWt(-,i) has gradient E t (-,i) and 
Hessian Vt (•,«), given by 



V t (P,i) 



(4a) 
(4b) 



where a® 2 — a<£> a — aa T . Consequently, the gradient and negative Hessian of log PL t (-) are 

U t (fi) = V[logPLtC9)] = x t m (imJ m ) - E tm (p,i m ), (5a) 

t m <t 

I t (fi) = -V 2 ! [log PL t (P)] = J2 V tm (f3,i m ). (5b) 



We call Ut(/3o) the unnormalized score and It{Po) the observed information matrix. 

Note the dependence of these terms on time-varying covariates, which precludes using suf- 
ficient statistics and introduces additional complexity in maximizing log.PL t (-). For most large 
interaction datasets, existing computational routines for handling Cox models (e.g., the function 
coxph from the survival package for R (Therneau and Lumley 2009)) will not suffice. In Ap- 
pendix |Aj we describe a customized method for maximizing log PL t (•) that exploits sparsity in 
x t(i,j)- 



3. Consistency of maximum partial likelihood inference 



Under the model of Section [2] the maximum partial likelihood estimator (MPLE) is a natural 
estimate of /3o; the inverse Hessian of log PL t (-) evaluated at the MPLE is a natural estimate of 
its covariance matrix. We now give conditions under which these estimators are consistent. 

In the sampling regime where observation time t is fixed and the number of senders \I\ 
increases, Andersen and Gill s ( 1982 ) consistency proof for the Cox proportional hazards model in 
survival analysis extends to cover model ([I]) . This setting is natural in the context of clinical trial 
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data, where I corresponds to the set of patients under study, but does not meet the requirements 
typical of interaction data. For most interaction data we cannot control I and J , and the 
only way to collect more data is to increase the observation time. Cox (1972 1975) outlines 



a proof for general MPLE consistency that applies to our sampling regime, but his argument 
is heuristic; Wong's ( 1986 1 treatment is more rigorous but does not cover continuous or time- 
varying covariates. The general interaction data sampling regime warrants a new consistency 
proof. 

Our proof of consistency relies on rescaling time to make the interaction times uniform. To 
this end, define marginal processes N t (i) — YljejN t (i,j) and N t = J2iei Nt(i); also note that 
t n = sup{i : N t < n} is a stopping time and let T tn be the er-algebra of events prior to t n . The 
main idea of the proof is to change time from the original scale to a scale on which t n — t n i is 
proportional to n — n' . 



3.1. Assumptions 

Let B be a neighborhood of /3q. For a vector, a, let ||a|| denote its Euclidean norm; for a matrix, 
A, let \\A\\ denote its spectral norm, equal to the largest eigenvalue of (A T A) 1 ^ 2 . We require the 
following assumptions: 

Al. The covariates are uniformly square-integrable. That is, 



E 



sup||x t (z, j)\\ 2 



is bounded. 



A2. The integrated covariance function is well-behaved. When (3 £ B and a g [0, 1], as 

n — > oo, then with respect to the covariance function S ct (/3) we have that 



E 



V s {(3,i)W s (f3,i)\ s (i)ds 4s a (/3). 



A3. The interaction arrival times are finite. For each n, 

P{t„ < oo} = 1. 

A4. The variance function is equicontinuous. More precisely, 

■fVt„(-, i) : n > 1, i G x| is an equicontinuous family of functions. 



These technical assumptions are similar to those of Andersen and Gill ( 1982 ), who investigate 
specific settings in which their assumptions hold. Note that when ||xt(«,j)|| is bounded and 
Assumption jA|3]is in force, the remaining assumptions follow. 



3.2. Main results 

Assumptions A[l]-A[4] imply that the MPLE is consistent and asymptotically Gaussian, as shown 
by the following two theorems. 

Theorem 3.1. Let N be a multivariate counting process with stochastic intensity as given 
in ([T]) ; with true parameter vector @q. Let t n be the sequence of interaction times, and set U t {P) 
and It (f3) to be the gradient and negative Hessian of the log partial likelihood function as given 
respectively in (5a I and (5b). If assumptions AH^-A^ hold, then as n — > oo: 
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(a) nT 1 / 2 Ut,: (/?o) converges weakly to a Gaussian process on [0,1] with covariance function 
E a (A>); °" 

f&j «/ assumptions afoo /io/d, i/ien /or any consistent estimator /3 n of f3 , we have that 



sup 

*e[o,i] 



We don't actually require convergence of the whole sample path, but it turns out to be just as 
much effort to prove as convergence of the endpoint. 

Consistency is a direct consequence of Theorem |3.1[ and is proved in Appendix [Bj 



Theorem 3.2. Let N be a multivariate counting process with stochastic intensity as given 
in (IT]), with true parameter vector f3o. Let the log partial likelihood, \ogPL t { ), be as defined 
in (pi . Let t n be the sequence of interaction times. 

Assume that for f3 in a neighborhood of/3o that — ^V 2 [log PLt n (/?)] A £i(/3), where Si(-) is lo- 
cally Lipschitz and with smallest eigenvalue bounded away from zero. If n maximizes log PL tn (-) 
and conclusion |a]) of Theorem 3. 1 holds, then the following are true as n —¥ oo : 

(a) /3 n is a consistent estimator of (3 ; 

(b) \/n(/3 n — Po) converges weakly to a mean-zero Gaussian random variable with covariance 

pi(A))]- 1 . 



Proof (Theorem 3.1). O bserve that the process N t (i,j) has compensator A t (i, j) = J Q X s (i,j) ds; 
similarly, processes N t (i) and N t have compensators A t (i) = Y^j^j ^-t(i,j) and A t = J2 ieX A t (i). 
Correspondingly, define local martingales M t (i,j) = N t (i,j) — A t (i,j), M t (i) = N t (i) — A t (i), 
and M t = N t — A t ; also define 

H t (i,j) = x t {i,j) - E t ((3 ,i), 



where E t ([3,i) is as defined in (4a 



As observed by Andersen and Gill (1982), the score function Ut{-) evaluated at /3o has a 
simple representation in terms of these processes: 



= EE f H s (i,j)dM s (iJ), 

Af-T A i- T J 



since Ylj^j Jo H s (i,j) dA s (i,j) = 0. Since by Assumption a|IJ x is uniformly bounded, H is as 
well. Each term in the sum above is thus locally square integrable, with predictable covariation 



H a (i,j)dM s (i,j), / H s (i',j')dMS',j') 



H s (i,j)®H s (i',j')d(M(i,j),M(i!,f)) s 
[H s (i,j)] m dA s (i,j)-l{i = i',j=j'} 
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(Fleming and Harrington 1991| Thm. 2.4.3). There exists a sequence of stopping times localizing 
all M(i,j) simultaneously, so U((3q) is locally square integrable with predictable variation 



( u (Po)) t = E E f [ H s(i,j)] m dA s (i,j) 
= E [ V a (fio,i)dA t (i). 



(6) 



Now we rescale time. For each positive n define a discretized time-scaled version of the score 
that is right-continuous with limits from the left. The process is defined for times a in [0,1]; 



between times in [— , ^p), it takes the value Ut k ; i.e. 



(7) 



Part |a|): Lemma B.l shows that {7q™- > (/3 ) is a square-integrable martingale adapted to 



•7"t| a „j , the cr-algebra of events prior to t^ an j . Since it only depends on values at jump 



fr( n ) 

j a 

times, the quadratic variation of U^ n \Po) at time a is equal to the quadratic variation of U(f3o) 
at time t \ an \. There fore, since quadratic and predictable variation have the same limit when 
it exists dReboUedo 1 r 11 i ,>,„,. 

Lemma 



B.2 



1980 Prop. 1), assumption A2 implies that (^^ (n) (/?o)) a ^ S a (/3 ). 
in turn verifies that -^=IJ^((3 Q ) satisfies a Lindeberg condition necessary for the 
application of Rebolledo's (1980) Martingale Central Limit Theorem. Thus the process converges 
in distribution to a Gaussian process with covariance function E Q (/?o) as claimed. 

Part Recalling M t (i) = N t (i) — A t (z), combine (5b) and Q to obtain the relation 



E 



LcnJ 



V,(fio,i) dM s (i) = I hani (/3 ) - {U (n) Wa)) a . 



(8) 



When a G [0, 1], a repeated application of the triangle inequality to 



hi, 



0n) - Wh ani (M - i han] (A,)) - S Q (/?o) 



using the relation of ^ yields 



< 



E 



:|anj 



{V s (P n ,i) -V s (j3 ,i)}dN s (i] 



+ 



E 

i 

E 



V s (Po,i)dM s (i) 



V s ((3 ,i)dA s (i)-i: a {{3 ) 



We show that all three terms converge to zero in probability. The first term above is uniformly 
bounded by sup„/ , ; \\V tn , (f3 n ,i) — Vt n , (/?o, , which converges to zero since (3 n — > /3 by hypothesis 
of the theorem and {V t ,(•,«)} is an equicontinuous family by assumption AW| Lemma B.3 



proves, as a consequence of assumption A[3] and Lenglart's (19771 Inequality, that the second 
term converges to zero uniformly in a. The third term converges to zero by assumption jA[2J 
thereby concluding the proof. 
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4. Multicast interactions 

In Sections [2] and |3j we have assumed that each interaction involves a single sender and a single 
receiver. The model and corresponding asymptotic theory are sufficient to cover strictly pairwise 
directed interactions (e.g., phone calls), but they do not describe interactions that can involve 
multiple receivers (e.g., e-mail messages). We call an interaction involving a single sender and 
possibly multiple receivers a multicast interaction. 

In practice, multicast interactions are typically treated in an ad-hoc manner via duplication — 
for example, interaction i — > {ji,j2ij3} gets recorded as three separate pairwise interactions 
i ji , i —> ]2 ! and i —¥ js — giving rise to approximate likelihood and inference. In this section 
we explore the implications of using this approximate likelihood in the multicast setting. In 
particular we show it to be closely related to an extension of our model for directed pairwise 
interactions, and that the bias introduced by such an approximation can be quantified and in 
certain cases corrected. 

To this end, we introduce an extension of the model to the multicast setting. Let X, J ', 
Jt(i)> Xt(i,j), and @o be as in Section[2j For each sender i and positive integer L, let \ t (i;L) 
be a non-negative predictable process called the baseline L-receiver intensity of sender i. Let 
(ij., i\, Ji), . . . , (t n , i n , J n ) be the sequence of observed multicast interactions, with tuple (t, i, J) 
indicating that at time i, sender i interacted with receiver set J. For a set J, let |J| denote its 
cardinality. 

Consider a model for multicast interactions where the rate of interaction between sender i 
and receiver set J is 

X t (i, J) = X t (i; |J|) • exp { ^ ^x t {i, j)} ■ ]J l{j e J t (i)}. (9) 
The log partial likelihood at time t, evaluated at j3, is 



logFL t (/3) =Y,{Y, P Tx ^™,j) -log [ exp{5> T x tm (w)}]}- 

\J\ = \Jm\ 



(10) 



Suppose instead of using the multicast model, we use duplication to get pairwise interactions 
from the original multicast data. If we use the model of ([!]) for the pairwise data and ignore ties 
in the interaction times, we obtain an approximate partial likelihood: 



log PL t (P) = ^ | ^ P T x t Ji m , j) - \J m \ log [ 53 exp{/3 T x tm (w)}] }■ (H 



We claim log PL t (/3) approximates log PL t (f3). Heuristically, replacing the sum over all sets 
of size \J m \ in ( |10[ ) with a sum over all multisets of size |J m | (i.e., allowing duplicate elements 
from Jt m {im)), observe 

log[ 53 exp{53/3 T 2 ; tm (* m ,j)}] « log [( 53 exp {f3 T x tm (i m , j)}) lJml ] 
\J\=\J m \ 

= |J m |log[ 53 ex p{P Tx t m (im,j)}]- 



In this sens e, log PL t ((3) « log PL t (/3). Section 4.1 makes this statement more precise, and 



Section 4.2 analyzes the bias introduced by maximizing log PL t (/3) in lieu of log PL t {P). 
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4. 1. Approximation error 

Define the receiver set growth sequence 



Gr, 



E 

t m <t„ 



l{|Jm| > 1} 
\Jt m {im)\ 



(12) 



This__sequence plays a critical role in bounding the error introduced by replacing log PL with 
log PL. Note that when \Jt m {i m ) \ is constant G n has linear growth, but when \Jt m {i m )\ increases, 
G n often has sublinear growth. For example, the Cauchy-Schwartz inequality gives 



v l{|J m |>l} i^- 

3l l^(UI 2 . 



so if |j7t m («m)| = w(y / ro), then G n = C(y / n). Theorem 4.1 bounds the approximation error in 
terms of G n . 

Theorem 4.1. Let (t m , i m , J m ) be a sequence of observations from a multivariate point pro- 
cesses with intensity as given in (j9j). Assume that sup_*J]x t (i, j)\\ and sup m \ J m \ are bounded in 
probability. If log PL and log PL are as defined in (10 11), and G n is as defined in (12), then 
for (5 in a neighborhood of (3q, 



and 



V[log PL tn 08)] - V[log PL tn (/3)] = P (G n ), 



V 2 : [log PL t J(3)} - V 2 [\og PL tn {(3)] = P {G n ). 



PROOF. When J C J t (i), set X t (i, J) = J2 3 e,j x t(hj) an d w t (f3,i,J) = exp{f3 T X t (i, J)}. As 
a slight abuse of notation, when j is an element of Jt(i), take ll Wt(f3, i, j) n to mean Wt(f3, i, {j})- 
Define weights 



W t (J3,i;L)= ]T *',-/), 

\J\=L 

W t (fi,i;L)= [ MP,iJ) 



and note that the approximation error in log PL t (/3) comes from replacing W with W. 
The gradients of the weights are 



E t ([3,i;L) = V [log W t ([3,i;L)] = 



1 

W t (P,i;L) 



^2 w t (i,J)X t (i,J), 



JQJt(i). 
\J\=L 



E t ([3,i;L) =V [log W t ([3,i;L)] =L 



The second is the expectation of 2;=i x t{^3l) when J i , . . . , J i, are drawn independently and 
identically from Jti%) with weights u>t(/3, i, •); the first is the same expectation, conditional on 
the event that ji, ■ ■ ■ ,Jl are all unique. Let Pt,p.i-L and Pt,p,i-,L denote the two probability laws 
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for and let 'Et,p,i;L and Et t p t i-L denote expectations with respect to them, so that 

E t ((3,i;L) = Et,/3,i;L[Y,i=i x t(i,3l)] and 2? f (,8, i; L) = E t ,a,i-,L [J2i=i x t(i,3l)] • 

The bound on V[log PL tn (/?)]-V[log PL tn (/?)] derives from a bound on E t {f3, i; L)-E t (0, i; L). 
Write 

L L 



E 



t,/3,i;L 



Et(p, i; L) - E t {fi, i; L) = E t>g>i . L [ £ x t {i, j t ) 

1=1 1=1 

We define probability law P£ g i . L and associated random variables ji, . . . , ]l and Ji, ■ ■ ■ , Jl, such 
that marginally Ji , . . . , are distributed according to Pt,/3,i ; L and Ji,...,Jl are distributed 
according to Pt,p,i-Li but the variables are coupled to have nontrivial chance of agreeing. Then, 



Et(fl,i;L)-Et{p,i;L) 



E 



< 2L ■ 



i=i 



. L [^2x t (i,ji) -^x t {i 
i=i 

\\xt(h3)\\] ■ n,8,i; L {(h, ■ ■ • ,3L) f (31, ■ • • ,Jt)} 



sup 



The coupling is as follows: 

(a) Draw (Ji, . . . , Jl) according to Pt,p,i-,L- 

(b) If (Ji,...,J L ) are all unique, set (j 1 ,...,j L ) = (ji, . . . otherwise draw (ji,...,jz,) 
independently according to Et,p,i-L- 



With K = supjgj^^ \\xt(i, j)\\, Lemma 



C.l 



shows 



"'Ifl, 1 ;l{(3 3 l) + i3l,---,h)] < Qj 



L\ exp{4if ||/3||} 



\Jt(i)\ 



The resulting bound on || V[log PL t (j3)] — V[log PL t (/3)]|| now follows by expressing 
V [log PL t {P)]-V [log PL t (P)} = E tm (p,i m ;\J m \) - E tm (P,i m ;\J m \)- 



Using \\E t (P,i;L) -E t (P,i;L)\\ < K L 2 (L - 1 



t m <t 

;.; , , ^ cxp{4K 



\Jt(i)\ 



we get 



V[logPL t (/3)] -V[logP£ t (/3)] <Xexp{4X||/3||}- £ 



i m <( 



|J m | 2 (|Jm|-l) 

\Jt m {i m )\ 



We get the final bound for the gradients by replacing the numerators of the summands with 
sup m \J m \. 

Using the same methods, Lemma C.2 derives the bound on the difference in Hessians. 



4.2. Bias correction from the approximate partial likelihood 

When we use ad-hoc duplication, we are performing approximate inference under the multicast 
model of (J9j) . In practice, even if we explicitly want to use the multicast model, computing the 
partial likelihood of ( 10 ) involves an intractable combinatorial sum, so we may resort to using 
the approximation instead. Maximizing log PL t (-) instead of log PL t (-) introduces bias in the 
estimate of (3q. Theorem 4.2 (proved in Appendix O) bounds the bias. 
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Theorem 4.2. Under the setup of Theorem J^.l let j3 n maximize log PL tn (-) and let fi. 



maximize log PL tn (-). Suppose for all n that the Hessian ^-V 2 [log_PL fTi (-)] is uniformly locally 
Lipschitz and with smallest eigenvalue bounded away from zero in a neighborhood of /3„. // 
G n /n -> 0, then 

\\p n -f3 n \\ = P (G n /n). 

It remains to be shown that j3 n is a consistent estimator of (3q. This follows directly from 
the theory in Section [3j since the multicast case can be considered as a special case of the 
single receiver case: Consider the product I x N+ as the sender set, and the power set V{J) 
as the receiver set. For sender (i,L), the process X(i; L) is then the baseline send intensity, and 
{J C J t (i) ■ \J\ — L} is the receiver set; for sender-receiver pair ({i,L), J), vector x t{h j) 

is the covariate vector. Consistency of the MPLE now follows from Theorem"^ 
Suppose the true MPLE, f3 n , is a y^-consistent estimate of /?o- (Theorem 



3.2 



gives sufficient 

conditions.) Theorem 4.2 says that if \Jt m {im)\ grows fast enough to make G n smaller than 
Op(- v /rz), then the approximate MPLE, /3 n , is also -y/n-consistent. Moreover, if \/n(/3 n — f3 ) is 
asymptotically Gaussian, then y/n(f3 n — /?o) is asymptotically Gaussian with the same covariance 
matrix but possibly a different mean. Under enough regularity, — A [V 2 log PL tn (/3„)] consistently 
estimates the limiting covariance of y / n(/3„ — (3 ). To get the mean, we use a parametric bootstrap 
as follows. 



Assume that the conditions of Theorem 4.2 hold. The residual f3 n — (3q depends continu- 



ously on /?o and the covariate process Xt(i,j). Since (3 n is a consistent estimator of /?o, we can 
estimate the bias in /3 n via a parametric bootstrap. We generate a bootstrap replicate dataset 
{(tmi imi Jin)} by drawing Jm\ a random subset of Jt m {i m ) with size | J m \ whose elements are 

drawn proportional to w t (P n ,i m ,-)- We then get a bootstrap approximate MPLE, fin , by 

~(r) 

maximizing PL tn , where 

logPL^fi) = { E ^XtJimJ) - \&og [ J2 exp{/? T aUw)}]|- 



Note that Xt(i,j) is determined from the original dataset, not the bootstrap dataset. For each 
bootsta 
bias by 

hias = 

R 



n(r) 

bootstrap replicate, we get a residual p„ — p„. With R bootstrap replicates, we estimate the 

i * 

r=l 

We adjust for estimator bias by replacing f3 n with f3 n — bias. 



5. Fitting the model to a corporate e-mail network 

Recall from Section[T]that, given a set of interaction data triples (t, a primary modeling goal 
lies in determining which characteristics and behaviors of the senders and receivers are predictive 
of interaction. The modeling and inference framework introduced above enables us to directly 
address these concerns, as we now demonstrate through the analysis corporate e-mail network 
consisting of a large subset of the e-mail messages sent within the Enron corporation between 
1998 and 2002. These e-mail interaction data give rise to the following questions: 

Homophily To what extent are traits shared between individuals (gender, department, or se- 
niority) predictive of interaction behaviors? 
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Fig. 1. Message recipient counts. Each of the 21,635 points corresponds to a message, with y-value 
showing the base-2 logarithm of its number of recipients (ranging from 1-57). 



Network Effects To what extent are dyadic or even triadic network effects, as characterized 
by past interaction behaviors, relevant to predicting future interaction behaviors? 

We undertake our analysis using the multicast proportional intensity modeling framework 
developed in Sections [2] and [3] above, employing both static covariates reflecting actor traits, 
as well as dynamic covariates capturing network effects. The bootstrap technique introduced 
in Section [4] for multicast interactions is then used to reduce bias in the estimated effects, as 
well as to demonstrate that our asymptotic approximations are reasonable in this data modeling 
regime. We conclude this section with a discussion of the goodness of fit of our model in this 
setting, before turning our attention in Section [6] to an evaluation of the strength of homophily 
and network effects in explaining these data. 



5.1. Data and methods 

Our example analysis uses publicly available data from the Enron e-mail corpus (Cohen 2009), 
a large subset of the e-mail messages sent within the Enron corporation between 1998 and 2002, 
and made public as the result of a subpoena by the U.S. Federal Energy Regulatory Commission 
during an investigation into fraudulent accounting practices. We analyze the dataset compiled by 



Zhou et al. (2007), comprising 21,635 messages sent among 156 employees between November 13, 



1998 and June 21, 2002, along with the genders, seniorities, and departments of these employees. 

Approximately 30% of these messages have more than one recipient across their To, CC, and 
BCC fields, with a few messages having more than fifty recipients (see Fig. [I]). In the subsequent 
analysis, we exclude messages with more than 5 recipients — a subjectively-chosen cutoff that 
avoids e-mails sent en masse to large groups. 

We model these data using the multicast proportional intensity model of Section |2J with 
X = J = {1, 2, . . . , 156} and Jtit) = I\ {i}, and with static and dynamic covariates described 
in the next section. We fit the model by first maximizing the approximate log partial likelihood 
log PL t {p) of (|TT|), and then employing a parametric bootstrap to estimate and correct the 



Point Process Modeling for Directed Interaction Networks 



13 



Variate Characteristic of actor % Count 



L(i) member of the Legal department 25 

T(i) member of the Trading department 60 

J(i) seniority is Junior 82 

F(i) gender is Female 43 

LJ(i) L(i) ■ J(i) 12 

TJ(i) T(i) ■ J(i) 24 

LF(i) L(i) ■ F{i) 13 

TF{i) T(i) ■ F(i) 10 

JF(i) J{i) ■ F(i) 27 



Fig. 2. Actor-specific traits, with counts of how many actors share each trait 



resultant bias in parameter estimates. We calculate standard errors using the corresponding 
asymptotic theory. In the setting of this example, the interaction count is high, so the asymptotic 
framework developed in Sections [3] and [4] is natural. The main violation of assumptions A[l]-A|4] 
is that our covariates (described in Section 5.2 ) may in principle be unbounded; nevertheless, 
bootstrap calculations (described in Section 5.3) show that the asymptotic approximations we 
employ remain reasonable in this regime. 

We wrote custom software in the C programming language to fit the model using Newton's 
method. Our implementation exploits structure in the covariates to make the computational 
complexity of the fitting procedure roughly linear in the number of messages and the number of 
actors. Appendix |A"| describes the fitting procedure in detail. It took approximately 70 minutes 
to fit the full model using a standard desktop computer with a 2.4 GHz processor and 4GB of 
RAM. Each bootstrap replicate took approximately 20 minutes to generate and fit, using the 
original estimate as a starting point for the fitting algorithm. Most of the complexity in the 
fitting procedure is due to the inclusion of triadic covariates as described below; including only 
dyadic covariates reduces the fitting time to less than 10 seconds. 



5.2. Covariates 

The goal of our investigation is to assess the predictive ability of actor traits and network effects. 
To this end, we choose covariates that encode these traits and effects. Each covariate is encoded 
as a component of the time- varying dyad-dependent vector x t (i,j), which is linked to the rate of 
interaction between sender i and receiver j via the multicast proportional intensity model of (fTl) . 



5.2.1. Static covariates to measure homophily and group-level effects 

Consider first those actor traits that do not vary with time: the actors' genders, departments, 
and seniorities. We encode the traits of actor i and their second-order interactions using 9 
actor-dependent binary (0/1) variables, as described in Fig. [2] 

We encode all 90 identifiable second-order interactions between the traits of sender i and 
receiver j as components of x t (i, j). We do this by using variates of the form Y(j) and X(i) -Y(j), 
where X and Y are chosen from the list of 9 actor-dependent variates (L, T, J, F, LJ, etc.). 
We cannot identify the coefficients for covariates of the form X(i) -1; if a component of Xt(i, j) is 
the same for all values of j, then the corresponding component of /3 will not be identifiable since 
the product of the two can be absorbed into At (*) without changing the likelihood. Covariates 
of the form 1 • Y(j) pose no identifiability problems. 
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send i «- j 



i has sent j a message in the past 



i has received a message from j in the past 



2-send 



there exists an actor h such that i has sent h a mes- 
sage and h has sent j a message in the past 



2-receive 



j there exists an actor h such that i has received a 
message from h, and h has received a message from j 



sibling 



there exists an actor h such that h has sent i and j 
messages in the past 



cosibling h there exists an actor h such that h has received mes- 

/ \ sages from i and j 



Fig. 3. Dynamic covariates to measure network effects 



We measure homophily by way of the estimated coefficients for covariates of the form X(i) ■ 
X(i). For example, if the sum of the coefficients of f • J(j) and J(i) ■ J(j) is large and positive, 
this tells us that Junior employees exhibit homophily in their choice of message recipients. 



5.2.2. Dynamic covariates to measure network effects 

Static effects are useful for determining which traits are predictive of the relative rate of interac- 
tion between sender i and receiver j, but they do not shed light on network effects. Therefore, 
we are also interested in the predictive relevance of the dynamic network behaviors described in 
Fig. [3j The first two behaviors (send and receive) are "dyadic," involving exactly two actors, 
while the last four (2-send, 2-receive, sibling, and cosibling) are "triadic," involving exactly 
three actors. 

To measure first-order dependence of message exchange behavior on these network effects, we 
introduce binary indicators for all 6 effects as components of x t {i,j). These indicators depend 
on the sender i, the receiver, j, and the history of the process at the current time t. By the 
shorthand notation f{send}, we denote the indicator variable depending on sender i, receiver 
j, and the current time, t, which indicates if i has sent j a message before time t, with the 
remaining notations (f {receive}, f {2-receive}, etc.) defined similarly. 

To measure higher-order time dependence, we introduce additional covariates of the following 
form. We partition the interval [— 00, t) into K = 7 sub-intervals: 

[-00, *) = [*- A K , t - A*-*) U [t - Ak-1, t - Ajr-a) U • • • U {t - A t , t - A ) 

where 00 = > A#-_i > ■ ■ ■ > Ai > A = and "t — 00" is defined to be —00. Specifically, 
we set Afc = (7.5 minutes) x4 l for k = 1, . . . , K — 1 so that for k in this range Afc takes the 
values 30 minutes, 2 hours, 8 hours, 32 hours, 5.33 days, and 2f .33 days. 
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Define the half-open interval l\ k ■* = [t — Ak,t — Afc_i). For k = 1,...,K we define the 
dyadic effects 

send| fe) (i, j) = #{i ->■ j in / t (fe) }, 



receive 



(fc) (^j) = #{j^*in/ t (fe) }; 



for sender i, such that these covariates measure the number of messages sent to, and respectively 
received by, receiver j in time interval lj® . 

The dyadic effects have been defined in the manner above to enable easy interpretation of the 
corresponding coefficients. To illustrate this, for k — 1, . . . , K, suppose that (3k is the coefficient 
corresponding to send^^z, j). If we observe the message i — > j at time t, then for future time 
t' in the interval (t,t + AjJ, the rate X t '(i,j) will be multiplied be the factor e^ 1 ; for t' in the 
interval (t + Ai, t + Aa], the rate will be multiplied by e@ 2 ; this continues similarly, with the rate 
being multiplied by e@ k whenever t' € (t + Ak-i,t + A&]; equivalently, when Afc_i <t' — t < A^. 
Thus, the coefficients f3i, . . . , (3 k measure the effect of a "send event" and how this effect decays 
over time. We expect that /3k will decrease as k increases, but we do not enforce this constraint 
on the estimation procedure. 

The triadic effects involve pairs of messages. For k — 1, . . . , K and I = 1, . . . , K we define the 
triadic effects 

2-sendf j) = #{* h in 4 fe) } ' #{h -> J in 4°}, 
2-receivef = ^ #{h ^ i in I { t k) } ■ #{j ^ h in J t (i) }, 

sibling^ > l \i,j) =J2#{h^iin l[ k ] } ■ #{h -> j in J t (i) }, 
cosiblingf = ^ #{i -> h in /,, (fe) } • #{j ft in I { t l) }. 

For sender i and receiver j, the covariate 2-sendj fc '^(i, j) counts the pairs of messages such that 
for some h distinct from i and j, message i — > h occurred in interval l[ k ^ and message ft — > j 
occurred in interval J t ; the other covariates behave similarly. 

As with the dyadic effects, the triadic effects are designed so that their coefficients have 
a straightforward interpretation. However, since triadic effects involve pairs of messages, the 
interpretation is a bit more involved. We illustrate with the 2-send| fc '^(i, j) covariate having 
coefficient f3k.i for k = 1, . . . , K and I = 1, . . . , K. Take i and j to be two actors. Suppose at 
time t we observe the message h —> j. At this point, we look through the history of the process 
for all messages of the form i — ¥ ft; when paired with the original h — > j message, each of these 
defines a "2-send event." The other 2-send events are defined as follows: if at time s we observe 
the message i — > ft, then we enumerate all observed messages ft — > j in the history of the process; 
when each of these is paired with the original i — > h event it constitutes a 2-send event. A pair 
(s, t) can be associated with each 2-send event, where s is the time of the i — > h message and t is 
the time of the ft — > j message. At time t' after s and t, the existence of the 2-send event causes 
the sending rate Xf(i,j) to be multiplied by the factor e' 9 ' i ' ! , where t' G (s + A^-i, s + A&] and 
t' € (t + Ai-i,t + A/]. We expect /3t y i to decrease as k and I increase, though again we do not 
enforce this constraint in the fitting procedure. 



As previously noted, Butts (2008) used a variant of the proportional intensity model to 



capture interaction behavior in social settings. As such, a correspondence can be drawn between 
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certain of the covariates in Butts ( 2008 1 and those outlined above. If we set K = 1, then the send* 
covariate is equivalent to an unnormalized version of Butts' persistence covariate, and the sum 
(send f +receive t ) becomes an unnormalized version of Butts' preferential attachment covariate. 
For the triadic effects, Butts' OTP, ITP, ISP, and OSP covariates are analogous to the 2-send, 
2-receive, sibling, and cosibling covariates, although the exact definitions differ slightly. (For 
example, OTP t (i,j) is defined as X)/i mm [#{* — > ^ in (— oo,t)}, — > j in (— oo, £)}].) The 
versions of these covariates that we have introduced above, however, are designed to enable a more 
precise quantification of the time-dependence of network effects, as well as a more straightforward 
interpretation of the corresponding coefficients. 



5.3. Bootstrap bias correction 

Given the model specification, data, and covariates outlined above, we can estimate the parameter 
vector (3 under the approximate log partial likelihood of ( 11 ). Recall that the results of Section]!] 
bound the bias resulting from this approximate MPLE procedure as a function of the growth rate 
of the recipient set J over time. Here, treating the set J of 156 Enron employees as constant, 
the resultant bias is of order l/\J\ — and, since \J\ = 156 is on the order of the square root 
of the number 21,365 of messages in the dataset, we can correct this bias using the parametric 
bootstrap outlined at the end of Section [4] 

Fig. [4] summarizes the corresponding bootstrap residuals (from 300 replicates) for each com- 
ponent of the estimated parameter vector /3 n ; we can see from this figure that treating messages 
with multiple recipients as multiple single-recipient messages introduces bias on the order of the 
standard error for most of the coefficients. There is a pronounced negative bias in coefficient 
estimates for the dyadic effects, which is representative of a more general phenomenon. Sparsity 
in the components of Xt{i,j) (when considered as a function of j), when combined with high 
values of the corresponding entries /3, leads to negative bias in the coefficient estimates when 
there are messages with multiple recipients. The approximation in ( |10[ ) is worst when for some 
j*, weight wt m {i m , j*) far exceeds all other values of Wt m (i m , j), so that Wt m (i m , j*) ~ Wt m (i m ); 
when \ J m \ is large, the maximum of PL will avoid this situation by shrinking j3 where xt m (i m , j) 
is sparse. The dyadic covariates are particularly sparse, so the estimates for their coefficients are 
particularly vulnerable to this bias. 

Besides correcting for bias, the bootstrap simulations give us confidence that the asymptotic 
approximations are reasonable. The simulated standard errors are very close to those predicted by 
the theory, despite the norm ||cc t (z, j)\\2 being potentially unbounded, contrary to the assumptions 
of Theorem [311 



5.4. Goodness of fit 

Figure [5] details an ad-hoc analysis of deviance for the fitted model, showing how the approx- 
imate deviance (twice the approximate log partial likelihood) behaves as we add consecutive 
terms to the model. Group-level (static) effects account for 18% of the residual deviance and 
network effects account for 34%. The most dramatic decrease in the residual deviance comes 
from introducing the "Send" terms into the model; with only 8 degrees of freedom, they are able 
to account for 31% of the null deviance. The full model accounts for 54% of the null deviance. 

The residual deviance for the full model is approximately 4.8 times the residual degrees 
of freedom, and so an ad-hoc adjustment for this over-dispersion is to multiply the calculated 
standard errors by V^T8 sa 2.2. 

Note, however, that the residual deviance by itself is not adequate as a goodness-of-fit mea- 



sure, as it depends only on the estimated coefficients (see Section 4.4.5 of McCullagh and Nelder 
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Fig. 4. Enron bootstrap residuals. Summary of bootstrap residuals for estimated coefficients using the 
Enron dataset, normalized by estimated standard errors. The points (orange) show the means, and the 
error bars (purple) show one standard deviation. Coefficients are grouped by model term. 
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Fig. 5. Ad-hoc analysis of deviance for the Enron model. Residual deviance is defined as twice the 
approximate negative log partial likelihood from {TTTJ. The "Static" term contains the group level effects, 
and the "Dynamic" term contains the reciprocation effects. 
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( 1989 ) for discussion of a related problem for logistic regression with sparse data). To shed more 



light on how well the model fits these data, we use a normalized version of the martingale residual 
from Therneau et al. (1990), which we call a Pearson residual. Specifically, given /3, we define 



t m <t 



W tm 0,i) 



which is the expected number of i 
estimated by the Breslow 
analogous to that of 



3 



events given the estimated model, with J X t (i) dt 
(|l974| estim ate / W t 0,i)~ x J2j dN i,j(t)- The martingale residual 
(1990) is then defined as N t (i,j) — Nt(i,j); we nor- 

Pearson" residual: 



Therneau et al. 



malize this quantity by an estimate of its standard deviation to get a 
(Nt(i,j)-N t (hj))/{Nt(iJ)} 1/2 - 



Fig. 



Ga. 



shows a plot of iVoo(i,j) versus Noo(i,j) for two different models. In the "static" 
model, we only include the static covariatcs, while in the full ("static and dynamic") model, we 
also include all six types of network covariates. The fit for the static model is poor. For instance, 
it repeatedly predicts up to 200 i — > j events where we only observed 1 or 2; likewise, the model 
predicts 1 or fewer events where we observed up to 20. For the full model, which includes the 
dynamic covariates to account for network effects, the fit is much better, with the relationship 
between observed and expected interaction counts being roughly linear. 

Fig. 6b shows the Pearson residuals. For the full model, more than 95% are less than 1.15 
in absolute value, and the maximum absolute residual is 21.8. In contrast, the 95% quantile for 
the absolute residuals in the static model is at 3.7, and the maximum absolute residual is 125.0. 
The sum of squares if the residuals (X 2 ) is 16498 in the full model, over 29 times lower than that 
for the static model (480661). We don't know what a "reasonable" value for X 2 is; an ad-hoc 
degrees of freedom calculation suggests that for the full number this should be roughly equal to 
23974 = 156 • 155 - (90 + 2 • 8 + 2 • 50), which suggests that the full model is too aggressive. The 
bootstrap simulations confirm this, with 23974 being 5.2 standard deviances below the mean 
value X 2 for the bootstrap replicates. 

For a more parsimonious model, we might drop most of the triadic effects. Indeed, the 
model which only uses dyadic effects has a X 2 value of 23785, which is very close to 24074 = 
156 • 155 — (90 + 2-8), the approximate X 2 degrees of freedom for that model. However, at this 
stage we desire a model with the lowest possible bias, and also wish to acquire estimates for all 
of the network effects. 



6. Evaluating the strength of homophily and network effects 

Given the model fitting procedure and results described above, we may now evaluate the strength 
of homophily and network effects in predicting the interaction behavior observed in our data. 
Along the way, we compare the results to those obtained from alternative data analysis methods. 



6. 1. Assessing evidence for homophily in the Enron data 

The analyses of Section [5] above have established that our multicast proportional intensity model 
with chosen covariates is reasonably accurate in describing message recipient selection, condi- 
tional on the sender and the history of the process. Thus, we are justified in using the estimated 
coefficients from the model to assess the predictive ability of the corresponding covariates. 

Our first task is to gauge the predictive strength of homophily. To this end, Fig. [7] shows the 
estimated group-level coefficients for our model. Notably, homophily appears present for almost 
all main effects (Department, Seniority, and Gender): the estimated coefficients of L(j), T(j), 
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Fig. 6. Goodness of fit plots for two models 
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Fig. 7. Estimated coefficients and standard errors for group-level covariates of the form X(i) ■ Y(j), where 
i is the sender, j is the receiver, and X(i) and Y(j) are given in the row and column headings; dark 
coefficients are significant (via Wald test) at level 10" 3 . 



J(j), and. F(j) are all negative, while the sums of the estimated coefficients of the following pairs 
are all positive: T(j) and T(i) ■ T(j); J{j) and J{i) ■ J(j); F(j) and F(i) ■ F(j). Taking Gender 
as an example, the way this homophily effect manifests is as follows: if i is sending a message 
at time t, and person j is identical to person f except for Gender, then i is more likely to send 
to the similarly-gendered individual. The relative rate is exp(— 0.08 + 0.41) w 1.4 if i is Female, 
and exp(0.08) « 1.1 if i is Male. The characterization for other types of homophily is similar. 

Conspicuously, the only trait for which first-order homophily was absent was for the Le- 
gal group: the sum of the estimated coefficients of L(j) and L(i) ■ L(j) was negative (—0.30), 
indicating that employees in the Legal group exhibited anti-homophilic behavior. 



6.2. A comparative analysis based on contingency tables 

Were we interested only in homophily, we might be tempted to forgo the proportional intensity 
model of ([I]), and instead perform a contingency table analysis. However, as we now describe, 
such an analysis leads to very different conclusions about the predictive strength of homophily 
in our data. 

For example, suppose that we are interested in testing for seniority-based homophily. Ignoring 
network effects and other dependency, we might model ¥{i — > j \ i}, the probability of employee 
j being the recipient of a message given that employee i is the sender, by way of a multinomial 
logit model: 

P{z -> j | z} cx exp{pjJ(j) + pjjJ(i)J(j)}. 

In this setting, Junior- Junior homophily would manifest in a positive value of /3j + f3jj and 
Senior-Senior homophily would manifest in a negative value of 

Since there are nj = 82 Junior executives and ns = 74 Senior executives, and since the 
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Fig. 8. Estimated coefficients and standard errors for the contingency table-based analysis of Section [6^2| 

sender and receiver of a message are distinct, we have that 

e (l3j+Pjj)J{j) 



P{z -> j | i, J(i) = 1} = 
F{i^j\i,J(i) = 0},= 



(nj - l)eP J +P JJ + n s 



e 0jJ(f) 



njeP- 1 + (ns — 1) 



In turn, we compute the corresponding maximum likelihood coefficient estimates using the entries 
of a 2 x 2 table that counts the number of messages exchanged between each group: 





Receiver 


Sender 


Junior Senior 


Junior 


7972 5833 


Senior 


3977 14479 



The resultant estimates are f3j = —1.4 and fijj = 1.6, with (Wald) standard errors of about 0.02 
for each. Indeed, these are exactly the estimated coefficients we would obtain if we were to use 
the proportional intensity model \t(i,j) = A t (i) exp{/3j J(j) + P.jjJ(i)J(j)}, a result that holds 
more generally for non-time-varying covariates. 

One problem with this analysis is that we have marginalized over the other covariates (Gender 
and Department), potentially introducing a Simpson's paradox. This issue is easily rectified, 
however, by introducing covariates for senders and receivers, along with the corresponding second- 
order covariate interactions; Fig. [8] shows the resulting coefficient estimates. 

The far more important problem is that this analysis implicitly assumes each message to 
be independent and identically distributed, conditional on the sender of the message. This 
assumption is blatantly false: common sense tells us that if Junior A sends a message to Junior 
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Fig. 9. Estimated coefficients for network indicator effects 



B, then the next time B sends a message, B is more likely to choose A as a recipient. Any 
homophily effect present in these interaction data is thus likely to be exaggerated by reciprocation 
and other network effects. Indeed, comparing the contingency table-based estimates in Fig. [8] 
with the estimates from the proportional intensity model with network effects in Fig. [7| we can 
see that the coefficient estimates are much higher when we don't adjust for network effects. 
Thus even in cases where network effects themselves are not the object of primary interest, it is 
important to account for them when making inferences about the predictive strength of other 
covariates. 

6.3. Evaluating the importance of network effects 

In Section |6.1| we established that homophily was predictive of sending behavior, even after 
accounting for network effects. We now investigate the characteristics of these network effects 
and establish which of these effects are of greatest importance. 

To begin our analysis, Fig.[9]shows the estimated coefficients for the network indicator effects, 
giving a crude picture of the predictive importance of each network effect. The estimated coef- 
ficients are all positive, indicating that network effects strengthen the ties between individuals. 
The estimated coefficient for l{send} is almost four times larger than the other coefficients, 
agreeing with the general notion that one is most likely to do today the things one did yesterday. 
The next tier of indicator effects comprises ljreceive}, ljsibling}, and l{2-send}, whose es- 
timated coefficients are each around 0.80. Two triadic effects, l{2-receive} and l{cosibling}, 
are not significantly predictive of sending behavior. 

The estimated coefficients for the recency-dependent covariates, shown in Figs. [10] and [TT] 
give a more complete picture of network effects. Firstly, we can see that dyadic effects persist 
for over three weeks from the time a message is sent. The decay of the estimated coefficients 
is roughly exponential in the time elapsed, corresponding to a super-exponential decay in the 
relative sending rate. For 30 minutes after i sends a message to j, our estimated model predicts 
that the rate at which i sends to j will be multiplied by exp(l.lO) ~ 3.01, and the rate at which 
j sends to i will be multiplied by exp(1.73) w 5.66; then, between 30 minutes and 2 hours, the 
rates will be multiplied by exp(0.50) ps 1.65 and exp(0.69) « 1.99, respectively; this proceeds 
similarly until after 21.3 days, when the rates will be multiplied by exp(0.003) s» 1.003 and 
exp(O.OOl) « 1.001. 

Comparing the coefficients for send^ with those of receive^ we see that the latter are 
higher for k < 2, while the former are higher for k > 2. The corresponding intuition is that if 
A is sending a message up to two hours after receiving a message from B, then A is likely to 
respond to B, but after that, A is more likely to send to an individual whom A e-mailed at the 
time of receiving -B's original message (provided B and this other individual are identical in all 
other respects) . The time window during which reciprocation is more important than past habit 
is less than 8 hours. 

From Fig. |11[ we can see that the triadic effects are in general less pronounced and are much 
more short-lived than the dyadic effects. About 70% of the estimated coefficients are within 
3 standard errors of 0; even those that are significantly nonzero mostly lie between —0.05 and 
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Fig. 11. Estimated coefficients for triadic effects, with standard errors 
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+0.05. The exceptions are the coefficients for 2-sendj 1 ' 1 ' 1 (0.27), 2-sendj 2 ' 1 '' (0.13), 2-sendj 3 ' 1 '' 
(-0.12), 2-receivef ,2) (0.06), sibling^ 1 ' 1 ) (0.10). and siblingf ' 3) (0.14). We may interpret 
these coefficients as follows: 

2-send If A sent B a message in the last 2 hours and B sent C a message in the last 30 minutes, 
then A will send messages to C at a higher rate; however, if A sent B a message between 2 
and 8 hours ago and B sent C a message in the last 30 minutes, then A will send messages 
to C at a lower rate. 

2-receive If C sent B a message between 30 minutes and 2 hours ago, and if B sent A a message 
between 1.3 days and 5.3 days ago, then A will send messages to C at a higher rate. 

sibling If B sent A and C messages in the last 30 minutes, then A and C are more likely to 
send messages to each other; if B sent A a message between 30 minutes and 2 hours ago, 
and if B sent C a message between 2 hours and 8 hours ago, then A is more likely to send 
C a message. 

Given the emphasis on transitivity in the networks literature, it may at first seem discon- 
certing that most of the estimated coefficients for the time-dependent triadic effects are found 
to be insignificant in this analysis. However, one must bear in mind that, except for messages 
sent to them directly, individuals likely have no knowledge of their colleagues' e-mail activities, 
and therefore there is no reason why this activity should directly affect sending behaviors. Any 
predictive power the triadic effects have, then, must be due to correlation with exogenous factors. 
In this light, it is not surprising that the triadic effects are small and have small time horizons. 



6.4. Comparative analyses using actor-oriented and exponential random graph models 

The results of Section |6.3| provide a detailed view of the ways in which network effects can 
manifest themselves in data. To further bolster our confidence in these results, we compare our 
corresponding parameter estimates to those obtained using related network models. 



A number of dynamic network models exist in the literature, including Hanneke, Fu, and 



Xing s (2010) exponential random graph model with time- varying coefficients, and Kolar, Song. 



Ahmed, and Xing s (2010) time- varying stochastic block model. Alternative approaches explic- 



itly based on point processes, but excluding network effects and other covariate information, 
include that of Malmgren et al. ( 2009 1 , who model activity at the level of the individual using 
a hidden Markov model, and of Heard et al. (2010), who work at the level of the dyad, assum- 



ing a piecewise-constant interaction rate. 



actor-oriented model of Snijders (2001 2005), which we now detail in Section 6.4.1 



The closest match to our approach is given by the 

Then, in 



Section |6.4.2[ we provide a comparison to a static network analysis based on an exponential 
random graph model. 



6.4.1. Actor-oriented model analysis 

The actor-oriented model is designed for a sequence of snapshots G\ , G2, ■ ■ • , of network activity, 
where each Gt is an / x J binary matrix representing pairwise connectivity between actors at 
time t. The model is best suited for ties that persist in time, not instantaneous events; it treats 
the sequence of networks as a first-order Markov chain, with the distribution of G t determined by 
Gt-i- Actors are assumed to change their ties between times t— 1 and t to maximize a stochastic 
utility function that depends on characteristics of the overall network. Essentially, given that 
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the network is in state G, and that actor i is allowed to make a change, he will change his link 
to actor j according to 



s' 



p(j\i, G) cx exp { P.T.(G(i ^ j)) + £ P' S T' S (G \ {i -+ j})}, 

s=l s=l 

where T s and T' s are network statistics; (3 S and (5' s are unknown coefficients; G(i ~> j) denotes the 
network obtained either by adding link i j if it is absent or removing link i j if it is present; 
G \ {i — > j'} denotes the network obtained by removing i — ¥ j if it is present. Additionally, the 
rate at which actor i changes ties between observation times t — 1 and t is given by specified 



via another parametric model. (See |Snijders et aL (20101 for a more thorough introduction.) 
The change probability function p(j\i, G) plays a similar role to the multiplier exp{x t (i, j) T /3} 
in the proportional intensity model, and the change rate function X(i) plays a similar role to the 
baseline intensity At(i). 

For purposes of comparison, we specified a change probability model p(j\i,G) with network 



statistics analogous to those used in Section 5.2 and then we estimated coefficient sets and 



analogous to the coefficients in the proportional intensity model. We used the RSiena package 



(Ripley and Snijders 2011) to specify and fit the actor-oriented model after binning the Enron 
e-mail interaction data at regular intervals to obtain network snapshots Gi, G2, and G3. Here, 
Gt{hj) = 1 if message i — > j was observed in period t, and G t (i, j) = otherwise; periods 
1-3 correspond to consecutive four-month periods in the year 2001. The subset we looked at 
contains 60% of the messages in the Enron corpus. We chose this particular subset and temporal 
resolution partially for computational reasons, but also to make the network change statistics 
(Jaccard coefficients) within the range recommended by RSiena (near 0.3). Approximately 2 
hours' time was required to fit the model. 

We included the following terms, chosen to mimic the covariates detailed in Section [5~2j 



(a) Outdegree/density (out). This statistic counts the number of outgoing ties; it is analogous 
to our Xt(i), except that the rate is the same for each sender. 

(b) Group- level edge effects (traits). One covariate is included for each identifiable first-order 
interaction of the form X(i)Y(j), where i is the sender and j is the receiver; the covariate 
counts the number of edges i —> j with X(i)Y(j) = 1. These effects correspond to the 
group-level effects in our model. We would have liked to include second-order interactions 
as well, but (for computational reasons) were unable to fit the model with higher-order 
effects. 



(c) Outdegree/density endowment (outendow). This statistic counts the number of deleted 
outgoing ties; it corresponds to the negative of the send term in our model. 



(d) Reciprocity (recip). This statistic counts the number of reciprocal ties; i.e., edge sets of 
the form {i —> j,j —> i}. It corresponds to the receive term in our model. 



(e) 3-cycles (3cycle). This statistic counts the number of cyclic triples; i.e., edge sets of the 
form {h — > i,i — > j,j —> h}. It corresponds to the 2-receive term in our model. 



(f) Transitive triplets (ttriple). This statistic counts the number of transitive triples; i.e., 
edge sets of the form {h —> i,i —> h,h —> j}. It corresponds to the 2-send, sibling, and 
cosibling terms in our model. 
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(b) Network effects 



Fig. 12. Estimated effects for the actor-oriented model of Section 6.4.1 



Note that after binning the interaction counts to form network snapshots as required by the 
actor-oriented model, it is impossible to separate the 2-send, sibling, and cosibling effects. 
Further, the first-order Markov nature of the model restricts our ability to quantify the time 
decay of the dynamic network effects. Additionally, as we were unable to include second-order 
interactions between the group-level effects in this setting, we cannot compare the estimated 
coefficients for the first-order interactions with those obtained from the proportional intensity 
model. However, including these first-order interactions in the model mediates the confounding 
role of homophily in estimating the network effects, our primary focus for this comparison. 
Figure 12 shows the fitted coefficients for the actor-oriented model. We can see that the 



estimated network effects agree qualitatively with those in Fig. |9j as outdegree/density endow- 
ment has a negative coefficient while reciprocity and transitive triplets have positive coefficients. 
Further, the dyadic coefficients are larger than the triadic coefficients. A discrepancy between 
the two models is that 2-receive had a negligible effect in the proportional intensity model, 
while its analogue (3cycle) had a small negative effect in the actor-oriented model. One possi- 
ble explanation for this discrepancy is that treating all ties as binary forces a negative bias in 
otherwise-unimportant network effects. With the limitation that ties are binary, when actor i 
tries to maximize his stochastic utility, he is forbidden from reinforcing an existing i — > j link; he 
is more likely to link to an actor f for which link i — > j' is absent. To counteract this tendency, 
the coefficient of 3cycle is forced to be negative. 



6.4.2. Exponential random graph model analysis 

As a final comparison, we fit an exponential random graph model to our data. This class of 
models — one of the more popular for estimating the importance of network effects — specifies a 
probability distribution for a single directed graph represented by a binary matrix G. It supposes 
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that P{G = g} oc exp{^^ =1 T s (g)} : where T s (g) is a network statistic, for example the number 



of transitive triples in the graph. (See | Anderson et aL (1999) for a detailed survey.) 



To apply this form of model, we employed a reduction of our data to obtain a single directed 
graph G as follows. Based on an "elbow" in the empirical cumulative distribution function of 
message counts N^i.j) in our data, we chose a threshold of 10 sent messages and defined G by 

G(i,j) = l{N oo (i,j)>10}. 

Next, as in our comparison to the actor-oriented model, we chose terms in the model to mimic 



the covariat es from Section 5.2 We used the ergm software package to fit the model (Handcock 
et al. 2011 ), based on a Markov chain Monte Carlo sample size of 10 5 following a burn- in period 



of 10 b iterates. The covariates were as follows: 

(a) Sender effects (sender). One covariate is included for each sender, measuring the outdegree 
of the sender. The corresponding coefficient plays the role of A t (z) in our model. 

(b) Group-level edge effects (edgecov). One covariate is included for each identifiable first- 
order interaction of the form X(i)Y(j), where i is the sender and j is the receiver; the 
covariate counts the number of edges i — > j with X(i)Y(j) = 1. These effects correspond 
to the group-level effects in our model. We attempted to include second-order interactions 
as well, but were unable (for computational reasons) to fit the model with these terms. 

(c) Mutuality (mutual). This statistic counts the number of mutual ties; i.e., edge sets of the 
form {i — ¥ j,j — > i}, and corresponds to the receive term in our model. 

(d) Cyclic triples (ctriple). This statistic counts the number of cyclic triples; i.e., edge sets 
of the form {h — > i, i — > j,j — > h}, and corresponds to the 2-receive term in our model. 

(e) Transitive triples (ttriple). This statistic counts the number of transitive triples; i.e., 
edge sets of the form {h — > i, i — > j, h — > j}. It corresponds to the 2-send, sibling, and 
cosibling terms in our model. 

Note that reducing the interaction data to a single directed graph has important modeling 
consequences. As with the case of the snapshot-based actor-oriented model detailed above, it is 
impossible to separate the 2-send, sibling, and cosibling effects, and the inability to include 
second-order interactions between group-level effects precludes a direct comparison with the 
estimated group-level effects from the proportional intensity model. Further, for a single directed 
graph, there is no possibility of including a term corresponding to send, and it is impossible to 
quantify the timc-dcpcndcncc of the network effects. 

Figure |6.4.2 shows the fitted coefficients for the exponential random graph model. As with 



the case of the actor-oriented model considered in Section |6.4.1| above, the estimated network 
effects agree qualitatively with those of Fig. [9j Specifically, mutuality and transitive triples have 



positive effects, while cyclic triples have a negligible effect (in contrast to the case of Fig. 12 from 



Section 6.4.1) 
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Fig. 13. Estimated coefficients for the exponential random graph model of Section [6A2| The model also 
included a term for each sender (not shown); furthermore, standard errors returned for the group-level 
edge effects were nonsensical, and so are not reported. 



7. Conclusion 



Our analysis of the Enron corpus in Sections [5] and [6] above has demonstrated the ways in 
which static and dynamic effects manifest themselves in e-mail communication networks, and we 
expect similar conclusions to hold broadly for other types of directed interaction data. Relative to 
alternatives such as contingency table analyses, actor-oriented network models, and exponential 
random graph models, an advantage of our approach lies in its ability to model the given data 
directly, rather than in an aggregated form. We are able to adjust for network effects to get 
more reliable estimates of homophily, and by using continuous-time information we get precise 
quantification on the time-dependent behavior of the network effects. 



The foundation of our work is Cox's ( 1972 1 proportional intensity model and partial likelihood 



theory, tools which he first introduced almost forty years ago and which have been significantly 

19931 



developed since then (Cox 



tinussen and Scheike 2006 



Mar- 1 

Cook and Lawless 20071). These tools are used extensively in the 



1975 Fleming and Harrington 1991 Andersen et al 



context of survival analysis, but require further development for use in modeling interaction data. 
In this vein, we have extended the associated theory in two directions: first, we have provided 
results that are asymptotic in time rather than in the size of the population under study; and 
second, we have shown that treating multicast interactions via duplication leads to bias in the 
parameter estimates (which can in turn be corrected in certain regimes). 

We find the proportional intensity model with time- varying covariates to be particularly useful 
for modeling repeated directed interactions. The model is simple, flexible, and well established, 
and it facilitates investigation into which traits and behaviors are predictive of interaction. 
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A. Implementation 

To compute the maximum partial likelihood estimator, we use Newton's method as described in 
Boyd and Vandenberghel (120041). This requires an efficient algorithm for computing the gradient 



and Hessian of the log partial likelihood. For simplicity, we describe the case of strictly pairwise 
interactions with no ties in the interaction times. We use the notation from Section [2j with the 
model from ([!]) and the partial likelihood from |2]). Recall that Xt(i,j) is in ]R P . Assume that 
\T\ = / and \ J\ = J. 

Suppose (ii, zi, ji), . . . , (t n ,i n ,j n ) is the sequence of observed interactions. Set n(i) = #{i m '■ 
im — £}• The partial likelihood factors into a product of terms, one for each sender: 



PLt{p)= n pL t(M, p£tCM=n 



iex 



t m <t 



This factorization allows us to compute log PL t (/?) and its derivatives by computing the sender- 
specific terms in parallel and then adding them together. 

The gradient and Hessian of the sender-specific log partial likelihood are respectively 



t m <t, 

i m —i 



t m <t, 
i m —i 



-V 2 [log PL t (p,i)} = V tm (P,i), 
t m <t, 



(15a) 
(15b) 



where E t (f3,i) and V t (f3,i) are as defined in (5a) and (5b I. When Xt(i,j) is constant over 
time, sufficient statistics for f3 imply that these formulae simplify. Otherwise, computing the 
first two derivatives of log PL tn (j3) necessitates iterating over all messages, potentially requiring 
time 0(nJp 2 ). For small- to medium-sized datasets, this is manageable, but for large network 
datasets it can become prohibitive. In the sequel we show how to exploit sparsity to drastically 
reduce the computation time. 



A.1. Initial values 

We will need to compute Wo(/3,i), Wo((3,i,j), E ((3,i), and Vo(/3, i) for all values of i and j. 
In the worst case, doing so will take 0(1 Jp 2 ). However, often the senders belong to a small 
number, I <C / of groups such that if i and i' are in the same group, then the corresponding 
values of Wq, tto, Eq, and Vq are the same, reducing the total complexity to 0(1 Jp 2 ). The 
remaining complexity estimates assume that the initial values have all been pre-computed. 



A.2. Exploiting sparsity 

We first decompose x into its static (non-time-varying) and dynamic parts as follows: 

x t (i,j) = x (i,j) + Ax t (i,j). (16) 

Typically, we can quickly compute the dynamic part Axt(i,j) at each observed message time by 
incrementally updating it. Further, Axt(i,j), is zero for most pairs — often Ax t (i,j) is zero 



30 P. O. Perry and P. J. Wolfe 

unless i and j have a common acquaintance or they have interacted in the past. For convenience, 
set J (i) = J. Let 

J(i) = {j £ J : j £ J t (i) and Ax t (i,j) ^ for some t } U { j £ J : j g J t (i) for some t }. 

For fixed t and i, assume that computing Axt(i, j) for all values of j takes amortized time 0(dJ). 
Since Jo{i) = J ', we have that 

w t ((3,i,j) = w (/3,i,j)-e^p{p T Ax t (i,j)} ■ l{j £ J t (i)} 
= w (/3,i,j) + Aw t (i,j); 

w t (p,i) = w (p,i) + Aw t(hj); 



where 



Aw t {i,j) = w Q (P,i, j)[exp{/? T A:c t (z, £ J t (i)} - 1]; 

here we have used that Awt(i, j) is zero unless j £ J{i)- Write 

then, defining 

W {P,i) . . Aw t {P,i,j) 

W) = 7i7Z5 A-Kt{P-,i,3) ~ 



we can express 7r t (/3, as follows: 

= lt(i)^o(l3,i,j) + Air t (f3,i,j). 

Moreover, given the initial values Wo(/J, i) and Wq(/3, we can efficiently keep track of 7t(«) 
and A7r t (/3, z, j): for any i and t, it takes amortized time O(Jdp) to evaluate 7t(i) and all values 
of Aw t (i,j) as j varies. 



A3. Computing the gradient 

In evaluating the gradient of the log partial likelihood as given by (15a), the sum J2 m x t m (hjm) 
can be computed in time 0(np), while the computationally expensive term is J2 m (A *m)< I n 
the sequel we show how to exploit sparsity in x to reduce the associated computational overhead. 

To simplify the notation, we suppress the dependence of all quantities on j3 and i. Consider 
7r t and Air t to be vectors of length J, and write 

Tit = 7t7r + A?r t . 

Also, let Xt = X t (i) and AXj = AX t (i) be the J x p matrices whose jth rows are Xt(i,j) and 
Ax t (i,j), respectively, so that 

X t = X + AX t . 

Using these expressions, we obtain 



E t = Xjnt = j t E + X^Airt + AXjn, 
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and thus, 

E ^ = ( E 7 t „> +^ T ( e a**.) + E 

mm mm 
i m —i im—i iin—i im—i 

Taking advantage of the sparsity in AX t and Aw t , computing the three sums on the right 
hand side takes time 0(n(i) J dp). Once the sums are known, the multiplication (^2^ft m j^o 

takes time 0(p), and the multiplication Xq ( A7r* m ) takes time 0(J p). Thus, we can compute 
Y) r m E tm in time O (n(i) J dp). Computing these terms separately for each i and then summing 
over all i to get the total gradient requires time 0(nJ dp + I p). 



A.4. Computing the Hessian 

Computing the Hessian according to (15b) proceeds similarly to the case of the gradient. We 
need to efficiently compute the sum 2J m Vt m {0,i m )] while a naive computation requires time 
0(n J p 2 ), this can be significantly improved by exploiting sparsity in x t (i,j). 

To this end, define n t (/3, i) to be the J x J diagonal matrix with [II t (/3, = 7r t (/3, i, j), and 
set An t (/3,i) = II t (/3,«) — n (/3,i). Suppressing the dependence on /3 and i, we have 

V t =Xj[Il t ~7r t 7rJ}X t 

= Xj[U t -n t TTj}X Q + AX?lTl t -Tr t n?]X 

+ X^[U t - TT t T:J}AX t + Al t T [n 4 - ^J]AX t . 

The first of these terms reduces to 



X T [n t -^7r t T ]X =ltV + lt {l-lt)E Q E^ - E ( lt ATT t ) T X l 



- X ( 7t Avr t )^ T + ^o T [ An *- A^ t A^ t T ]X , 

and the second can be expressed as 

AX t T [II 4 - ir t nJ]X = { lt AX t Ti t )E^ + Al t T p f + 7: t ATrJ}X Q . 

The third term is the transpose of the second; the fourth does not simplify. 

To compute the sum Y] m Vt , we only accumulate sums of terms that change with time: 

— im—i m 

7u Air t , 7 t(l - 7 t), JtAnt, ATT t AnJ, lt AX t n tl AX?[n t + 7r t A7r t T ], and Al t T pt - TT t nJ]AX t . 
Doing so takes time 0(Jdp 2 ) for each time increment. As with the gradient computation, we 
compute the sums separately for each i and then sum over all i, so that the total computation 
time is 0(nJ dp 2 + Ip 2 ). 



A. 5. Total computation time 

To perform one Newton step in maximization of the log partial likelihood of ([2]), we must first 
compute the gradient and Hessian of the log partial likelihood at the current value of /3, and then 
compute the inverse of the Hessian and its product with the gradient. Once we have the Hessian, 
computing its inverse takes time 0(p 3 ). Typically, it takes 0(1) Newton steps to compute the 
maximum of a convex function (the constant is often below 30). The key factors in determining 
the computation time using the factors laid out above are /, J, and d: 

• The value of / depends on the structure of xo(i,j). Specifically, / is equal to the number 
of distinct values of the matrix X (i) as i varies. For the Enron data, we have that / = 12: 
each sender belongs to one of 12 groups determined by group (L/T/O), seniority (J/S), 
and gender (F/M), and so the matrix Xo(i) depends only on the group of i. 
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• The value of J depends on the sparsity of Xt(i,j). If Xt(i,j) includes only dyadic network 
effects, then J will typically be of size 0(1) or 0{J a ) for a fractional value a; when we 
add triadic effects, this size will typically grow to at most 0(J 2a ). 

• The value of d depends on further structure in xt(i,j). In our implementation, d = 0(1) 
for dyadic effects and d = 0(J) for triadic effects. 

The total computational cost per Newton step is thus 0(I J p 2 + n J dp 2 + 1 p 2 +p 3 ), with the 
significance of this expression being that it is nearly linear in /, J, and n. Thus, the algorithm 
scales naturally to large datasets. 



B. Results from Section H] 



B. 1 . Lemmas from the proof of Theorem 3. 1 



3.1 



under assumption the process U^^q) 
from ^ is a square-integrable martingale adapted to J-a = J~t lanj . 



Lemma B.l. Using the notation of Theorem 



Proof. The conditional expectation property holds provided E[[/ tii (/?o) | -^-i] = ^t„_i (AO- 
Define K = sup tiJ \\x t {i, j)\\. Note that \\H t (iJ)\\ < 2K. Thus, 

WtAtMW <2K(N tMn +A tAtn ), 

so that 



E 



SUp \\UtAt. 

t 



.,(A))II 



< 



(EK 2 ) 



ENf 



EA 



2 \V 2 



By assumption ^[T] EK 2 is finite, and by construction, N tn is bounded. Since N tAtri is a counting 



process, EA, is finite, too (this follows from results in Section 2.3 of Fleming and Harrington 



(1991)). Thus, UtM n (Po) is uniformly integrable. The Optional Sampling Theorem now ap- 
plies to give the conditional expectation property of C/(™)(/3 ). For square integrability, note 
su Pl < m <„E||;y tm || 2 < E [sup, ||C/ tAt „(/? )|| 2 ] • 

Lemma B.2. Using the notation of Theorem \3.1\ under assumption the Lindeberg con- 
dition for Rebolledo's (1980) Central Limit Theorem is satisfied: for any positive e, 



E 



\H s (i,j)\\ 2 l{\\H s (i,j)\\ > v/i £ }dA s (i,j)4o. 



Proof. With K — sup t i j ||xt(i,j)|| as above, the integral is bounded by 4K 2 l{n 1 / 2 K > 

e/2} • Since EK 2 < oo by assumption a[TJ the first term converges to zero in probability. 
Since EA tn = EN tn = n, the product of the two also converges to zero in probability. Thus, the 
Lindeberg condition is satisfied. 

Lemma B.3. Using the notation of Theorem 3.1 , under assumptions j^^and j^^we have that 

rt, 



E 



V s ((3 ,i)dM s (i) 



4o 



uniformly in a. 
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Proof. Lenglart's ( |1977[ ) Inequality and assumption A[3j imply that for any positive p and 

S m f 1 



sup 

te[o,t„ 



V / K(/3 ,z)dM s (z) 
, Jo 



> 



< 



|v;(A).*)lrdA.(*)>* 



(see Fleming and Harrington ( 1991 Cor. 3.4.1) for a related proof). As in the proof of Lemma B.l 
set K = sup t ^ ■ ||x t (i, _7 ) 1 1 . The sum is bounded by ■ Since rT^^K 2 A by assumption 

tmd EAt n = n, the right-hand side of the inequality converges to -4 ■ Since 5 is arbitrary, the 
t-hand side must converge to zero. 



B.2. Proof of Theorem \3^ 

We follow Haberman's ( |1977[ ) approach to proving consistency, which relies on Kantorovich's 
( 1948 ) analysis of Newton's method. Tapia ( 1971 ) gives an elementary proof of the Kantorovich 



Theorem. We state a weak form of the result as a lemma. 

Lemma B.4 (Kantorovich Theorem). Let P(x) = be a general system of nonlinear 
equations, where P is a map between two Banach spaces. Let P'(x) denote the Jacobian (Frechet 
differential) of P at x, assumed to exist in Dq, a convex open neighborhood of Xq. Assume the 
following: 

(a) \\[P'(x )]-'\\<B, 

(b) iitP'Oro)]-^)!!^, 

(c) \\P'(x) ~ P'(y)\\ < K\\x - y\\, for all x and y in D , 
with h = BKrj < \ . 

Let fi* = {x : \\x — xq\\ < 27/}. If C Dq, then the Newton iterates, Xk+i = Xk — 
[P'(xk)]~ 1 P(xk), are well defined, remain in and converge to x* in fl* such that P(x*) = 0. 
In addition, 

77 {2hf k 



Xk 



< 



h 2 k 



0,1,2, 



PROOF (Theorem 3.2 ) . Set U t {-) and I t {-) to be the gradient and negative Hessian of the log 



partial likelihood, as defined in ([5a 5b). Since it(/3) is a sum of rank-one matrices with positive 
weights, it is positive semi-definite, and logPL t (-) is a concave function. By the assumption 
that the smallest eigenvalue of Si(-) is bounded away from zero in a neighborhood of /?o, for n 
sufficiently large, if log PL t (-) has a local maximum in that neighborhood then it must be the 
unique global maximum. 

We find the local maximum by applying Newton's method to the gradient of ^ log PL trl (-), 
taking /3q as the initial iterate. Define 

Zn = -[±ItM]- 1 [±U tn (f3 )). 



The first Newton iterate, f3 n ,i, is equal to /3q — Z n . Part (b) of Theorem 3.1 and the assumptions 
of the theorem imply [^It n (A))] -1 exists for n large enough, so that Z n is well-defined. Moreover, 

Part (a) of Theorem 3_1 and Slutsky's Theorem imply Z„ 4 and y/nZ n 4 jV(0, [Ei(/3 )] _1 ). 

Apply Kantorovich's Theorem to bound ||/3„ — /?o|| and ||/3„ — /3 n ,i||- By assumption, there 
exists a neighborhood of (3q, say Dq, and finite K and B, such that ||-^t„(/3) — hlt„ < 
K\\f3 - p\\ and ||i[J tf ,(A))]~ 1 || < B for e D . Define 77„ = \\Z n \\ and h n = BKr, n , noting 
that h n and rj n are size 0-p{n- 1/2 ). Thus, for n large enough, 
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(a) ||/3 n -A)|| <2t7„4 0, 

(b) Vn||/3 n - (A) - Z n )\\ < 2^ Vn h n 4 0. 

Thus, /?„ — > /3 , and ^n(/3 n — /3 ) and ^/nZ n converge weakly to the same limit. 

C. Results from Section @] 

Lemma C.l. Using the notation and assumptions of Theorem \4-l\ 

'L\ exp{4Fr||/3|j} 



■ ■ • Jl) ^ (Ji, • • • , Jl)} < 



\Jt®\ 



- t,/9,t;. 



where K = sup t ||xt(z, J)||. 
Proof. 

■■■i3l)t l (Ji, • • • = p t*/3 :l; i{ji, ■■■ ,Jl has a duplicate} 

<E p *w{j*=j»} 

Note exp{-i4: ||/3||} < w t (f3,ij) < exp{K\\ /3\\}, so that 



fc<; 



1 2 exp{4if ||/?||} 
Lemma C.2. Using the notation and assumptions of Theorem\4-l\ 



V 2 [logPL t (/?)]-V 2 [logPL t (/3)] <2K 2 exp{4X||/3||}. £ l ' 7 "' ( l ' 7 '" 1 



t m <t 



\Jt m {im)\ 



Proof. The argument is similar to the bound on the difference in gradients in the proof of 



Theorem 4.1 The Hessians of the weights are 
V t (P,i]L) = V 2 [log W t ((3,i;L)] 



= WJbTL) £ w t (P,i,J)[X t (i,J)-E t (l3,i;L) 

t[P: ' ' JCJ t (t), 
\J\=L 

V t (/3,i;L) = V 2 [\ogW t (P,i;L)] 



L ■ 



Ejej- t (i)^(A«,i) x t (i,j)- ±fit{0,i;L) 



02 



The first is the covariance matrix of %t(i, ji) under f't,/3,i-L', the second is the covariance 

matrix of the same quantity under F^p^L- The result follows in the same manner as in the proof 
of Theorem 14.11 The relevant intermediate bound is 



V t (P,i;L)-V t (fi,i;L) 



\Jt{l)\ 
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Proof (Theorem |4.2[) . We know that Newton's method applied to - log PL tn (•) converges 



to j3 n after sufficiently ma ny it erations. We employ j3 n as the initial iterate and use the Kan- 



torovich Theorem (Lemma B.4| to bound \\j3. 



In the notation of the lemma, P(-) is the gradient of ^ log PL tn (-) and P'(-) is its Hessian. 



The conditions of Theorem 4.2 imply assumptions (a) and (c) hold uniformly in n for some finite 
B and K. Set 

V 2 [A log PL tn n )] 1 _1 fv [i log PL tn n )] 



4.1 



and the boundedness of the 



and set h n — BKrj n . Since V[logPL tri (/3„)] = 0, Theorem 
inverse Hessian imply r\ n — Op(G n /n). Therefore, for n large enough 

HA. " k\\ < \ { ~^- = 2 Vn = OAG n /n). 
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